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Music consists of strings of sound that vary over time. Technical devices, such as tape 
recorders, store musical melodies by transcribing event times of temporal sequences 
into consecutive locations on the storage medium. Playback occurs by reading out the 
stored information in the same sequence. However, it is unclear how the brain stores and 
retrieves auditory sequences. Neurons in the anterior lateral belt of auditory cortex are 
sensitive to the combination of sound features in time, but the integration time of these 
neurons is not sufficient to store longer sequences that stretch over several seconds, 
minutes or more. Functional imaging studies in humans provide evidence that music 
is stored instead within the auditory dorsal stream, including premotor and prefrontal 
areas. In monkeys, these areas are the substrate for learning of motor sequences. 
It appears, therefore, that the auditory dorsal stream transforms musical into motor 
sequence information and vice versa, realizing what are known as forward and inverse 
models. The basal ganglia and the cerebellum are involved in setting up the sensorimotor 
associations, translating timing information into spatial codes and back again. 
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MUSICAL MELODIES AS SEQUENCES AND OBJECTS 

Musical melodies are sequences of sound with particular rhythm, 
loudness and timbre. As such, they are concatenations of dis- 
crete elements over time, which can continue for seconds or 
minutes. However, we can learn to recognize melodies as a 
single entity, as we recognize extended objects in either the 
visual or auditory modality, and we can assign a name to 
them ("Twinkle, twinkle, little star" or "Yankee doodle"). In this 
more holistic view, a melody is an entity that requires inte- 
gration of its elements over time and, ultimately, coding by a 
specific, limited ensemble of neurons in the brain. This latter 
representation is likely to be situated in the auditory ventral 
stream, where representations of "auditory objects" have been 
found (Tian et al., 2001; Zatorre et al., 2004). In a hierar- 
chical model, information about spectral structure and tem- 
poral modulation, including pitch, are stored in early ventral 
areas and in core (Leaver and Rauschecker, 2010; Schindler 
et al., 2013); higher-order object information, e.g., about timbre, 
which would reveal the identity of an instrument or singer, 
is most likely found in the anterior-most regions of superior 
temporal cortex (Leaver and Rauschecker, 2010) and in ven- 
trolateral prefrontal cortex (Cohen et al., 2009; Plakke et al., 
2013). Even in the most hierarchical model, however, it seems 
unlikely to find single neurons responding selectively to lengthy 
melodies, just as it seems unreasonable to expect single neu- 
rons to respond to specific sentences in the language domain. 
So how is the identity of a sound sequence warranted in the 
brain? 



For speech, regions in the anterior superior temporal cortex 
(aSTC) have been found that respond to phonemes or words, 
including short standard phrases (DeWitt and Rauschecker, 
2012), but not to whole sentences. The latter would seem to 
reside in the auditory dorsal stream instead, where represen- 
tations of sequences have been found (Schubotz et al., 2004). 
Activation of dorsal-stream regions, including supplementary and 
pre-supplementary motor areas (SMA, pre-SMA) or ventral and 
dorsal premotor cortex (vPMC, dPMC), has also been reported 
during singing (Perry et al., 1999), listening to music (Chen et al., 

2008) , and during anticipatory imagery of music (Leaver et al., 

2009) . 

But how does the storage process of lengthy sound sequences 
really happen? This is not at all a trivial question, and the brain 
mechanisms governing the processing, storage and retrieval of 
sequences are far from understood. It may be advantageous, 
therefore, to briefly consider how technical devices do this. 

HOW TAPE RECORDERS WORK 

A tape recorder is an audio storage device that records and plays 
back sounds, including music and speech, using magnetic tape 
as a storage medium. It records a fluctuating audio signal by 
moving the tape across a "tape-head" that polarizes the magnetic 
domains in the tape in proportion to the audio signal (modified 
from Wikipedia). Electric current flowing in the tape-head creates 
a fluctuating magnetic field, which causes the magnetic material 
on the tape to align in a manner proportional to the original 
signal, as the tape is moving past the head. The original signal can 
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be re-produced by running the tape back across the tape head, 
where the reverse process occurs — the magnetic imprint on the 
tape induces a small current in the reading head, which approxi- 
mates the original signal and is then amplified for playback on a 
loudspeaker (from Wikipedia). 

Thus, a tape recorder stores music by moving a storage 
medium (the tape) past a device (the head) that represents the 
sound waves in the form of a fluctuating electro-magnetic field. 
A turntable or CD player follows the same principle of using the 
movement of a recording medium to translate time into space, 
this time in the form of a spiral track. In all cases, the recording 
process can be inverted into a playback process by the reverse 
mechanism, moving the recorded medium past the reading device 
at the same speed, thus recreating the original signal. 

The important message to be gleaned from this is that technical 
devices store musical melodies (as well as other sequences) by 
re-coding time of occurrence into spatial positions. Further- 
more, storage and retrieval of the sequence utilize the same 
mechanism, differing only by inversion. Applied to the brain, 
it is attractive to think that information is stored in the same 
places where the original activation takes place, and that record- 
ing and read-out are also accomplished by similar, but inverse 
mechanisms. But how is the order of events in a time sequence 
preserved? At first, the only way to form temporal associations 
between stored items would seem to be by "chaining" the events 
together, whereby one event becomes the cue for the next one 
(Ebbinghaus, 1964). Read-out takes the form of cued recall. 
Although this idea has been criticized (Lashley, 1951; Terrace, 
2005), it still provides one possible mechanism for storing a 
sequence, but it remains unclear how it is implemented in the 
brain. 

Obviously, unlike a tape recorder or CD player, the brain does 
not have any moving parts for the translation of time into space. 
Then again, digital storage devices (solid-state or flash drives) no 
longer require moving tapes or spinning discs. These devices store 
audio as a stream of numbers representing the amplitude of the 
audio signal at equal time intervals. The numbers get stored in 
the order they are received, and a "controller" assures that they are 
read out in the same order later. This form of storing a sequence 
requires a positional code, i.e., the re-coding of event time into 
position in space, something that has been postulated variously 
for models of short-term memory as well (Henson and Burgess, 
1997). 

In summary, technical devices universally store sequences by 
re-coding time of occurrence into spatial positions, and the 
fundamental question arises: How does the brain translate tem- 
poral events in a sequence into spatial patterns or a spatial 
gradient? 

NEURAL MECHANISMS FOR THE ENCODING OF SEQUENTIAL 
ORDER 

TEMPORAL COMBINATION SENSITIVITY AS THE MOST ELEMENTARY 
MECHANISM 

Most simplistically, music consists of two essential elements: 
frequency (or pitch) and rhythm. However, while rhythm (dura- 
tion of tones and the intervals between them) is obviously 
important, we can still recognize a melody (within limits) even 



when rhythmic elements are omitted. Recent results confirm that 
pitch and rhythm are indeed processed and stored independently 
(Schellenberg et al., 2014). Thus, the most essential element for 
the recognition of a melody is the order of the notes it consists of. 
If that order is changed, or the melody is played in reverse, recog- 
nition is impaired or fails altogether. Again, there is commonality 
between music and language (c.f. Patel, 2008; Patel and Iversen, 
2014), as language comprehension also becomes impossible when 
its elements are played in reverse (either at the word or sentence 
level) (Bornkessel-Schlesewsky and Schlesewsky, 2013). 

A neural mechanism that is commonly invoked for imple- 
menting this reversal sensitivity is the combination of inputs 
over time (temporal combination sensitivity, TCS). Just as in its 
twin mechanism, spectral combination sensitivity (Margoliash 
and Fortune, 1992), the target neuron acts as a logical AND- 
gate which fires only if several inputs are active simultaneously. 
In the time domain, delay lines can be used to hold up some 
of the inputs long enough until all other inputs have arrived 
(Figure 1). These asymmetric delays have the effect of creating 
selectivity for temporal order on a short time scale in the order 
of hundreds of milliseconds. Thus, temporally asymmetric delays 
can be created by spatial asymmetries on a miniature scale similar 
to direction selectivity in the visual system. This mechanism 
creates FM detectors with pronounced selectivity for the direction 
of an FM sweep (Tian and Rauschecker, 2004; see also Tian et al., 
2013 for further analogies between elemental detectors in visual 
and auditory cortex). 

PREM0T0R AREAS AS SEQUENCING MACHINES 

While the above TCS mechanism works well at durations cor- 
responding to syllable or word level, it breaks down when the 
strings of sound become longer. Under those circumstances, one 
may assume that chaining mechanisms come into play, where 
the end of one short sequence triggers the beginning of the 
next, like in a game of dominoes. Such mechanisms have been 
postulated in particular for the motor system, where the execution 
of smooth movements requires precise timing and order of muscle 
activations. Brain substrates that play a role for the learning, 
planning and execution of such sequential behavior are thought to 
be the cerebellum, the striatum, and various regions of premotor 
and prefrontal cortex (Hikosaka et al, 1996; Sakai et al., 1999; 
Fuster et al., 2000; Yin, 2010). While premotor and prefrontal 
areas are most important for planning and execution, cerebellum 
and basal ganglia are involved at different stages of learning of a 
motor sequence. In particular, cerebellum and striatum differ by 
the time scales they apply to the transformation of temporal into 
spatial patterns. 

It is important to keep in mind that music is often created 
by another person making it. That is, someone is producing 
the music before we can listen to it, and a melody is first and 
foremost a motor sequence that happens to produce sounds. 
This is true even if we produce the music ourselves. We produce 
music by virtue of activating muscles that move our vocal cords, 
lips and jaws (during singing or whistling) or, depending on the 
type of musical instrument played, we move our arms, fingers, 
feet, and sometimes our lips in coordination with our breathing 
apparatus (This is similar again in speech, where we learn to 
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FIGURE 1 | Auditory direction selectivity of a cortical neuron as 
cellular basis for sequence selectivity. (A) Schematic drawing of a 
neuron in the lateral belt of rhesus monkey auditory cortex, illustrating 
temporal combination sensitivity (TCS). Input from lower-order neurons is 
integrated at the level of the lateral belt in a nonlinear fashion 
(Rauschecker et al., 1995). The belt neuron acts as a logical AND-gate 
and fires only if the membrane potential surpasses a given threshold. 



Temporal delay lines generate order sensitivity such that a sound 
sequence will excite the neuron only if presented in a specific order 
(from Rauschecker, 2012). (B, C) Example of a response by a neuron in 
the lateral belt to a species-specific vocalization. Spectrograms of the call 
and its temporal components are shown in (B) together with the 
reversed call (on extreme right). The neuron's response (shown in (C» to 
individual "syllables" and to the reversed call is strongly diminished. 



produce a sound by moving our muscles of the lips, tongue etc. 
in coordination with the vocal cords and breathing muscles). 
Hearing another person produce these sounds may trigger the 
same or similar muscle movements, with the goal of producing 
the same sounds. This may happen either as a form of imitation, 



or directly as a result of sensorimotor interaction that, by neces- 
sity, intertwines perception and action during the production of 
these sounds. In other words, the feedback from hearing (and 
to some extent proprioception) is a necessary prerequisite for 
normal production of producible sounds. The process can best 
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FIGURE 2 | Participation of auditory dorsal stream in coding of musical 
sequences. (A) Activation of areas in the auditory dorsal stream by 
anticipation of familiar music. Activated areas include the supplementary 
and pre-supplementary motor areas (SMA, pre-SMA), the inferior parietal 
lobule (IPL), posterior cingulate cortex (PCC), globus pallidus and putamen 
(GP/Pu) of the basal ganglia, and the cerebellum (CB) (from Leaver et al., 
2009). (B) Illustration of the auditory ventral and dorsal streams in the 
human brain (modified from Rauschecker and Scott, 2009). This expanded 
model originated from the original dual-pathway model of Rauschecker and 
Tian (2000) by generalizing the role of the dorsal stream to one of 
sensorimotor integration and control, which includes processing of space 
and motion as well as storage and retrieval of sound sequences, the latter 
especially relevant for processing of music. 



be appreciated in reference to speaking or singing, where we 
have the same "instrument", our vocal apparatus, at our disposal 
as the models we are trying to emulate. However, even when 
listening to a musical instrument that we are not capable of 
playing ourselves, we can produce the same melody by generating 
tones in the same order and with the same timing as the ones we 
listen to. 

It will be interesting to find out when this ability to re-produce 
sound sequences first develops. Although young infants have the 
ability to recognize familiar melodies as early as 2 months of age, 
they do not develop relative pitch until ~6 months of age and not 
without exposure to music (Plantinga and Trainor, 2009). 

SENSORIMOTOR LEARNING IN N0NHUMAN PRIMATES 

Studies in monkeys have shown that during learning of a new sen- 
sorimotor association the basal ganglia are very active (Pasupathy 
and Miller, 2005). The same has been shown by functional imag- 
ing studies in humans that are learning new sequences (Leaver 
et al., 2009; Yin, 2010). These results assign a role to the basal 
ganglia in the chaining or stitching together of new sensorimotor 
associations or, more succinctly, in the transformation of tem- 
poral order information into a spatial code (Kalm and Norris, 
2014). After a sequence is well learned, activation of premotor and 
prefrontal regions becomes increasingly prominent, while basal 
ganglia activation weakens (Figure 2A; Leaver et al., 2009). This 
reflects the formation of chunks of sequence items, consistent 
with human learning and imaging studies (Janata and Grafton, 
2003), which are stored in frontal areas like pre-SMA and SFG 
(Sakai et al., 1999; Sakai and Passingham, 2003). The activa- 
tion moves more rostral as the sequence becomes more familiar 
(Leaver et al, 2009). This is consistent with a caudal-to-rostral 
hierarchy within prefrontal cortex (Badre and D'Esposito, 2009), 
where rostral areas control activity in more caudal modality- 
specific areas (Sakai and Passingham, 2003). 

It is currently unclear if it is possible to learn a new melody 
or sequence without engaging these sensorimotor mechanisms by 
just passively listening to it. As a melody becomes increasingly 
familiar, it often becomes impossible to suppress the urge to sing 
along. While the learning of a new song or a new piece played 
on an instrument results in the building of "muscle memory" 
by tuning the motor and premotor structures of the brain, this 
may not happen in individuals that lack the corresponding skills. 
It would be interesting to see if there are certain forms of amu- 
sia that lack the ability to reproduce or recognize music, and 
whether this is actually a weakness of their sensorimotor mem- 
ory and also affects their general ability to remember sequences 
(c.f. Tremblay-Champoux et al., 2010). Interestingly, some forms 
of congenital amusia involve structural changes in the inferior 
frontal region (Hyde et al., 2007), but more research is needed to 
possibly tie these changes to a domain-general deficit in sequence 
processing. 

SINGING IN BIRDS 

Vocal learning is not unique to humans. It is common in a variety 
of animal species (Patel and Iversen, 2014), especially birds. Some 
songbird species (such as zebra finches or starlings) learn their 
melodies from a conspecific teacher, usually their father (Comins 



and Gentner, 2010; Adret et al., 2012); others (such as parrots or 
bullfinches) can also imitate words or melodies they hear from 
humans (Eda-Fujiwara et al., 2012; Nicolai et al., 2014). 

A wealth of neurobiological studies in several songbird species 
suggests that their neural apparatus for audio-motor learning 
is quite similar in principle to that of humans and nonhuman 
primates, consisting of premotor-basal-ganglia circuits that work 
in conjunction with higher auditory centers to encode the mem- 
orized songs (Achiro and Bottjer, 2013). In particular, recent data 
from zebra finches show that vocal motor circuits also partici- 
pate in the encoding of auditory experience of the vocal model 
(Roberts and Mooney, 2013). Thus, a universal circuit model is 
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beginning to emerge from these comparative studies that might 
ultimately lead to an understanding of storage and retrieval of 
sound sequences in biological systems. 

SYNTHESIS: MELODIES IN VENTRAL AND DORSAL STREAMS 

Much evidence suggests that the dual auditory processing streams 
originally postulated for the monkey (Rauschecker, 1997, 1998; 
Romanski et al., 1999; Rauschecker and Tian, 2000) also exist in 
humans. The ventral auditory stream is important for the encod- 
ing of complex spectral information, including pitch (Bendor 
and Wang, 2005), and ultimately for the identification of sound 
objects. The dorsal stream was originally defined by its involve- 
ment in auditory spatial processing (Rauschecker and Tian, 
2000) and movement in space (Warren et al., 2002). This is 
still believed to be correct (Rauschecker, 2012), but the role of 
the dorsal stream has been expanded to include sensorimotor 
integration and control in more general terms (Rauschecker and 
Scott, 2009; Rauschecker, 2011), including the representation of 
sequences. 

A particularly interesting and important feature of the 
expanded dorsal stream is that it represents both inverse and 
forward models (Figure 2B). The forward model is what has 
classically been referred to as an "efference copy" (von Hoist and 
Mittelstaedt, 1950; Troyer and Doupe, 2000). Whenever premotor 
cortex neurons fire in preparation of an action, they not only send 
their message towards the motor cortex for potentially real action, 
but they also inform sensory systems about the consequences of 
this action. Conversely, an inverse model (Crush, 2004) instructs 
the motor system about sensory signals that are relevant for 
reaching its goals. Both of these signals are compared within the 
dorsal stream, presumably in parietal cortex, and play a role for 
optimal state estimation by minimizing the resulting error signal 
(Rauschecker and Scott, 2009). 

The ability of posterior parietal cortex to perform transforma- 
tions in space may also come to bear in terms of melodic "space". 
We can easily recognize a melody when it is played in a different 
key, that is, when pitch relations between notes are preserved. An 
imaging study contrasting a transposed melody to the original 
melody revealed greater activation in the intraparietal sulcus (IPS; 
Foster and Zatorre, 2010), which points to the role of the IPS in 
subtracting the effects of the transposition. 

Finally, the question arises whether musical melodies, once 
they are learned, are simply defined by their existence as concate- 
nated sequences in sensorimotor regions of the auditory dorsal 
stream. The fact that they can be sung or played, imagined 
and anticipated almost automatically on a given cue seems to 
demonstrate that this is indeed the case. However, as mentioned 
in the Introduction section, we can also put a name or a label 
on a familiar melody, which suggests that there is a second form 
of existence for music in the brain besides concatenated sounds. 
The "chunks" formed in rostral prefrontal cortex that become 
apparent in fMRI studies of highly familiar music, may be the 
endpoint of the sequencing process in the dorsal stream. At the 
same time, however, they may also be the starting point of a 
feedback process (via the inferior frontal cortex) into the ventral 
auditory pathway, where more information is added, for instance, 
about the timbre of musical instruments playing a specific tune 



or about its emotional connotations. This object-identification 
process would enable a musical melody not just to receive a name, 
but also to trigger memories of all things past that are associated 
with that melody. 
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